5023B

Data Science for Biologists

Dr Philip Leftwich

About me

Associate Professor in Data Science and Genetics at the University of East Anglia.


Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.


Teach Genetics, Programming, and Statistics

UEA logo

Introductions

Outline of the course

  • Advanced linear modelling
  • Power analysis
  • Data reproducibility
  • R programming
  • Machine Learning

Expectations

  • One workshop per week

  • One lecture per week

  • One assignment per week

  • One ‘capstone’ project

What to expect during this course


I hope you end up with more questions than answers!


Schitts Creek questions gif

Source: giphy.com

Reproducible Research

What is Reproducibility?

Introduction to Reproducible Research

Turing Way Community cc-by licence

  • For Research to be reproducible both data and methods should be available.

  • Applying the described methods to the data leads to the same results

Methods

  • In theory, method availability ≠ code

  • But with complex data and analyses - are methods of data collection enough?

Self-correcting Science?

  • Science advances incrementally by identifying and rectifying errors over time

  • Peer review: Critical evaluation of papers by experts maintain quality

  • Independent studies either support or fail to replicate findings

Self-correcting Science?

  • Publication bias: preference for positive results

  • Pressure to publish

  • Poor study designs and statistical issues

  • Lack of transparency

Reproducibility trial:

246 biologists get different results from same data sets

: Forest plots of meta-analytic estimated standardized effect sizes (Zr, blue triangles) and their 95% confidence intervals for each effect size included in the meta-analysis model. (A) Blue tit analyses: Points where Zr are less than 0 indicate analyses that found a negative relationship between sibling number and nestling growth. (B) Eucalyptus analyses: Points where Zr are less than 0indicate a negative relationship between grass cover and Eucalyptus seedling success

Reproducibility Crisis

  • The reproducibility crisis emerged when numerous studies, especially in fields like psychology, medicine, and biology, failed to be replicated by other researchers.

  • High-profile replication attempts revealed that many published results could not be consistently reproduced, raising doubts about their validity.

Crisis as an Opportunity

  • Recognition that no study should be considered ‘definitive’

  • Empower lasting systemic change through increased transparency in research methods, data sharing and reporting

  • Structural change in academic culture

Open Science

Open Science

Open science is a global movement that aims to make scientific research and its outcomes freely accessible to everyone. By fostering practices like data sharing and preregistration, open science not only accelerates scientific progress but also strengthens trust in research findings.

UKRN

  • UK Reproducibility Network - funded by UK Research Council

  • 46 member institutions (UEA is one)

  • Establish open research practices across UK Research

  • https://www.ukrn.org/

UKRN

Project management

Tidy projects

/home/phil/Documents/paper
├── abstract.R
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── figure.png
├── figure1.png
├── figure10.png
├── partial data.csv
├── script.R
└── script_final.R

Organised projects

  • README

  • Documented

  • Easy to code with

  • All files are inside the root folder

R projects

R projects

Slugs

  • A string of characters defining a file

What do you think are the contents of these files:

  • data/raw/madrid_minimum-temperature.csv

  • scripts/02_compute_mean-temperature.R

  • analysis/01_madrid_minimum-temperature_descriptive-statistics.qmd

Name files

Come up with good names for these:

  • a dataset of cats with columns for weight, length, tail length, fur colour(s), fur type and name.

  • a script that downloads data from Spotify.

  • a scripts that cleans up data.

  • a scripts that fits a linear discriminant model and saves it to a file.

R projects and clean slates

R projects

  • Use projects

  • Check your code runs on blank slates

Quarto

  • Automates the creation of a paper or report

  • Saves time

  • Reduces errors

copy-paste

(https://www.nature.com/articles/d41586-022-00563-z)

Git

copy-paste

Git repository

copy-paste

Git collab

copy-paste

Forking

copy-paste

Renv

copy-paste

copy-paste

Benefits

Week Two: Descriptive Statistics

Building Statistical Models

  • What is a Statistical Model?

  • A model is a simplified representation of real-world processes.

  • It helps us describe, explain, and predict outcomes.

  • A good fit makes accurate predictions; a poor fit can lead to misleading conclusions.

  • To make reliable inferences, the model must accurately represent the data.

“Later, we’ll see how models help us test hypotheses using p-values to assess if the data fits our expectations.”

Populations and Samples

Population:

  • The entire group you want to study (e.g., all humans, all mice in a lab).

  • Studying the entire population is often impractical due to time, cost, or logistics.

Sample:

  • A subset of the population, selected to make inferences about the whole.

  • Must be representative to ensure accurate conclusions.

Descriptive Statistics

  • Summarize data to highlight key features:

  • Central Tendency: Where is the “center” of the data? (mean, median, mode)

  • Spread: How variable are the data? (variance, standard deviation)

  • Helps us understand the data before making inferences.

Central tendency

The central tendency of a series of observations is a measure of the “middle value”

The three most commonly reported measures of central tendency are the sample mean, median, and mode.

Mean Median Mode
The average value The middle value The most frequent value
Sum of the total divided by n The middle value (if n is odd). The average of the two central values (if n is even) The most frequent value
Most common reported measure, affected by outliers Less influenced by outliers, improves as n increases Less common

Mean

One of the simplest statistical models in biology is the mean

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Mean 2.6
Median 3
Mode 3

Calculating the mean:

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \] \[\frac{5 + 3 + 3 + 2 + 1}{5} = 2.6\]

Sum all the values and divide by n of values

The mean

One of the simplest statistical models in biology is the mean

Lecturer Friends
Mark 5
Tony 3
Becky 3
Ellen 2
Phil 1
Mean 2.6
Median 3
Mode 3

We already know this is a hypothetical value as you can’t have 2.6 friends (I think?)

Now with any model we have to know how well it fits/ how accurate it is

Symmetrical

If data is symmetrically distributed, the mean and median will be close, especially as n increases.

This is also know as a normal distribution

Variance & Standard Deviation

Variance:

The average of the squared differences from the mean.

\[s{^2}_{sample} = \frac{\sum(x - \bar{x})^2}{N -1}\] Higher variance = more spread.

Standard Deviation (SD):

  • The square root of variance.

  • Easier to interpret because it’s in the same units as the data.

Calculating variance

Symmetrical sides

Lecturer Friends Residuals Sq Resid
Mark 5 2.4 5.76
Tony 3 0.4 0.16
Becky 3 0.4 0.16
Ellen 2 -0.6 0.36
Phil 1 -1.6 2.56
Mean 2.6
  • Sum of Squared Residuals = 9

  • N-1 = 4

  • Variance = 9/4 = 2.25

Standard Deviation

  • Square root of sample variance

  • A measure of dispersion of the sample

  • Smaller SD = more values closer to mean, larger SD = greater data spread from mean

  • variance:

\[ \sigma = \sqrt{\sum(x - \overline x)^2\over n - 1} \] ## N-1? {.smaller}

Sampling

For a population the variance \(S_p^2\) is exactly the mean squared distance of the values from the population mean

\[ s{^2}_{pop} = \frac{\sum(x - \bar{x})^2}{N} \]

But this is a biased estimate for the population variance

  • A biased sample variance will underestimate population variance

  • n-1 (if you take a large enough sample size, will correct for this)

More here

Standard deviation - our example

Lecturer Friends Diff Squared diff
Mark 5 2.4 5.76
Tony 3 0.4 0.16
Becky 3 0.4 0.16
Ellen 2 -0.6 0.36
Phil 1 -1.6 2.56
Mean 2.6
variance 2.25
SD 1.5

Standard Deviation

Small \(s\) = data points are clustered near the mean

Large \(s\) = data points are widely dispersed around the mean

Why the Normal Distribution matters

Shape: Symmetrical, bell-shaped curve

  • Described by just two parameters: mean (μ) and standard deviation (σ).

  • Rule of Thumb: For normally distributed data:

    ~68% within 1 SD

    ~95% within 2 SDs

    ~99.7% within 3 SDs

Relevance: Many statistical tests assume data follows a normal distribution. This helps us calculate probabilities—like p-values—to test hypotheses.

Why the Normal Distribution matters

If we assume a normal distribution (or close enough), we can calculate the probability of observing any given value using just the mean and standard deviation.

\[ f(x) = \frac{1}{\sqrt{2\pi} \, \sigma} \exp\!\Biggl(-\frac{(x - \mu)^2}{2\,\sigma^2}\Biggr) \]

This has applications in hypothesis testing and building confidence intervals

Visualising a Distribution

Histograms plot frequency/density of observations within bins

Quantile-Quantile plots plot quantiles of a dataset vs. quantiles of a theoretical (usually normal) distribution

Z-Scores and the Standard Normal Distribution

Problem: Different datasets have different means and standard deviations.

Solution: Standardization allows comparisons by converting any normal distribution into a standard normal distribution (mean = 0, SD = 1).

\[ Z = \frac{X - \mu}{\sigma} \]

  • Z: How many standard deviations a value is from the mean.

  • X: Observed value.

  • μ: Population mean.

  • σ: Population standard deviation.

Why do Z scores matter?

  • The standard normal distribution allows us to calculate probabilities of observing extreme values.

  • Example: If 𝑍 = 2, the observation is 2 standard deviations above the mean.

Using a Z-table, we can find the probability of getting a result at least this extreme.

This concept extends to p-values, which tell us how rare our data is under the null hypothesis.

P-values

What is a P-value?

A p-value is the probability of obtaining results at least as extreme as the ones we observed, assuming the null hypothesis is true.

  • It helps quantify how surprising or unusual our data is under the null hypothesis.

  • A low p-value suggests that the observed data would be rare if the null hypothesis were true, which may lead us to question the null hypothesis

Example

Imagine flipping a coin 100 times, expecting about 50 heads if it’s fair (the null hypothesis). If you observe 90 heads, the p-value tells you how likely it is to get such an extreme result just by chance.

Hypothesis formation

  • Null hypothesis: Assumes there is no difference between groups or no relationship between variables. It represents the “status quo” or baseline expectation.

Example: A new drug has no effect compared to a placebo.

  • Alternative hypothesis: Assumes there is a difference or relationship.

Example: The new drug does improve patient outcomes compared to the placebo.

Null Hypothesis Probability Distribution

  • The shaded areas in the tails represent extreme outcomes (typically the most unexpected 5% if using an \(\alpha\) = 0.05).

  • If your observed data falls within these tails, it’s considered statistically significant, suggesting it’s unlikely to occur by random chance under the null hypothesis.

Impact of Sample Size on Distributions

  • Key Point: Larger sample sizes reduce variability, making it easier to detect small differences as statistically significant.

  • However, statistical significance doesn’t always mean practical importance—especially with large samples.

Hypothesis testing

  • Scenario: Testing if diet A and diet B affect mice longevity.

  • Null Hypothesis (H₀): No difference in longevity between diets.

  • Alternative Hypothesis (H₁): There is a difference in longevity.

lm(longevity ~ diet, data = mice)

Explanation: This linear model performs a t-test to assess if the mean lifespans differ significantly between diets.

Visualising the T Distribution

  • The t-distribution shows where your observed mean difference falls relative to what’s expected under the null hypothesis.

  • Confidence Intervals (CI): The red area marks the 95% CI. If zero falls outside this range, it suggests statistical significance.

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     4.19     0.276     15.2  3.87e-13     3.62     4.77 
2 dietmice_b     -1.41     0.391     -3.60 1.60e- 3    -2.22    -0.595

Writing About Statistical Results

“Diets can change the longevity of mice (p = 0.0016).”

  • “Mice on diet B lived significantly shorter lives than mice on diet A (t22 = -3.6, p = 0.006).”

  • “Mice on diet B had a reduced mean lifespan of 1.41 years[95% CI; -0.595:-2.22] from the mice on diet A (mean 4.19 years (95% CI, 3.62-4.77). While statistically significant (t22 = -3.6, p = 0.006), this is a relatively small sample size, and further testing is recommended to confirm this effect.”

  • Which has the greatest level of useful detail?

Common misundertandingsabout P values

A p-value is NOT:

  • The probability that the null hypothesis is true.

  • The probability your results occurred “by chance.”

  • Proof of a meaningful or large effect.

What it IS:

  • A measure of how surprising your data is under the assumption the null hypothesis is true.

Common misconceptions about P-values

Evidence of P-hacking

A non-significant p-value DOES NOT mean the null hypothesis is true

In reality, we can’t know for sure if a true mean difference exists.

For illustration: Assume we could know the true mean difference.

The figure shows:

Grey line: Expected data if the null hypothesis is true.

Black line: Expected data if the alternative hypothesis is true.

A p-value shows how surprising the data are if the null is true.

A low p-value is evidence against the null, not proof of the alternative.

Why a significant p-value does not mean the null hypothesis is false

What we can conclude, based on our data, is that we have observed an extreme outcome, that should be considered surprising. But such an outcome is not impossible when the null-hypothesis is true.

Why a significant p-value is not the same as an important effect

If we plot the null model for a very large sample size, we can see that even very small mean differences will be considered ‘surprising’.

However, just because data is surprising, does not mean we need to care about it. It is mainly the verbal label ‘significant’ that causes confusion here – it is perhaps less confusing to think of a ‘significant’ effect as a ‘surprising’ effect.

Resources

Additional resources

  • Discovering Statistics - Andy Field

  • Happy Git

  • An Introduction to Generalized Linear Models - Dobson & Barnett

  • An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani

  • Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.

  • Ecological Statistics with contemporary theory and application

  • The Big Book of R (https://www.bigbookofr.com/)

  • British Ecological Society Guides to Better Science

*(SORTEE)

Reading list

Reading list